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Abstract 

We show the existence of a Locality-Sensitive Hashing (LSH) family for the angular distance 
that yields an approximate Near Neighbor Search algorithm with the asymptotically optimal 
running time exponent. Unlike earlier algorithms with this property (e.g., Spherical LSH [1, 2]), 
our algorithm is also practical, improving upon the well-studied hyperplane LSH [3] in practice. 
We also introduce a multiprobe version of this algorithm, and conduct experimental evaluation 
on real and synthetic data sets. 

We complement the above positive results with a fine-grained lower bound for the quality 
of any LSH family for angular distance. Our lower bound implies that the above LSH family 
exhibits a trade-off between evaluation time and quality that is close to optimal for a natural 
class of LSH functions. 


1 Introduction 

Nearest neighbor search is a key algorithmic problem with applications in several helds including 
computer vision, information retrieval, and machine learning [4]. Given a set of n points P C 
the goal is to build a data structure that answers nearest neighbor queries efficiently: for a given 
query point g G find the point p G P that is closest to q under an appropriately chosen distance 
metric. The main algorithmic design goals are usually a fast query time, a small memory footprint, 
and—in the approximate setting—a good quality of the returned solution. 

There is a wide range of algorithms for nearest neighbor search based on techniques such as 
space partitioning with indexing, as well as dimension reduction or sketching [5]. A popular method 
for point sets in high-dimensional spaces is Locality-Sensitive Hashing (LSH) [6, 3], an approach 
that offers a provably sub-linear query time and sub-quadratic space complexity, and has been 
shown to achieve good empirical performance in a variety of applications [4]. The method relies 
on the notion of locality-sensitive hash functions. Intuitively, a hash function is locality-sensitive if 
its probability of collision is higher for “nearby” points than for points that are “far apart”. More 
formally, two points are nearby if their distance is at most ri, and they are far apart if their distance 
is at least r 2 = c • ri, where c > 1 quantihes the gap between “near” and “far”. The quality of a 

‘The authors are listed in alphabetical order. 
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hash function is characterized by two key parameters: pi is the collision probability for nearby 
points, and p 2 is the collision probability for points that are far apart. The gap between pi and p 2 
determines how “sensitive” the hash function is to changes in distance, and this property is captured 
by the parameter p = }°g , which can usually be expressed as a function of the distance gap c. 

The problem of designing good locality-sensitive hash functions and LSH-based efficient nearest 
neighbor search algorithms has attracted significant attention over the last few years. 

In this paper, we focus on LSH for the Euclidean distance on the unit sphere, which is an 
important special case for several reasons. First, the spherical case is relevant in practice: Euclidean 
distance on a sphere corresponds to the angular distance or cosine similarity, which are commonly 
used in applications such as comparing image feature vectors [7], speaker representations [8], and 
tf-idf data sets [9]. Moreover, on the theoretical side, the paper [2] shows a reduction from Nearest 
Neighbor Search in the entire Euclidean space to the spherical case. These connections lead to a 
natural question: what are good LSH families for this special case? 

On the theoretical side, the recent work of [1, 2] gives the best known provable guarantees for 
LSH-based nearest neighbor search w.r.t. the Euclidean distance on the unit sphere. Specifically, 
their algorithm has a query time of 0{n^) and space complexity of for p = 2 c 2 _i E.g., for 

the approximation factor c = 2, the algorithm achieves a query time of At the heart of the 

algorithm is an LSH scheme called Spherical LSH, which works for unit vectors. Its key property 
is that it can distinguish between distances ri = \/2/c and r 2 = \/2 with probabilities yielding 
p = 20^-1 formula for the full range of distances is more complex and given in Section 3). 
Unfortunately, the scheme as described in the paper is not applicable in practice as it is based on 
rather complex hash functions that are very time consuming to evaluate. E.g., simply evaluating 
a single hash function from [2] can take more time than a linear scan over 10® points. Since an LSH 
data structure contains many individual hash functions, using their scheme would be slower than 
a simple linear scan over all points in P unless the number of points n is extremely large. 

On the practical side, the hyperplane LSH introduced in the influential work of Charikar [3] 
has worse theoretical guarantees, but works well in practice. Since the hyperplane LSH can be 
implemented very efficiently, it is the standard hash function in practical LSH-based nearest neighbor 
algorithms^ and the resulting implementations has been shown to improve over a linear scan on real 
data by multiple orders of magnitude [14, 9]. 

The aforementioned discrepancy between the theory and practice of LSH raises an important 
question: is there a locality-sensitive hash function with optimal guarantees that also improves over 
the hyperplane LSH in practice? 

In this paper we show that there is a family of locality-sensitive hash functions that achieves 
both objectives. Specifically, the hash functions match the theoretical guarantee of Spherical LSH 
from [2] and, when combined with additional techniques, give better experimental results than the 
hyperplane LSH. More specifically, our contributions are: 

^This running time is known to be essentially optimal for a large class of algorithms [10, 11]. 

^Note that if the data points are binary, more efficient LSH schemes exist [12, 13]. However, in this paper we 
consider algorithms for general (non-binary) vectors. 
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Theoretical guarantees for the cross-polytope LSH. We show that a hash function based 
on randomly rotated cross-polytopes (i.e., unit balls of the fi-norm) achieves the same parameter p 
as the Spherical LSH scheme in [2], assuming data points are unit vectors. While the cross-polytope 
LSH family has been proposed by researchers before [15, 16] we give the first theoretical analysis of 
its performance. 

Fine-grained lower bound for cosine similarity LSH. To highlight the difficulty of obtaining 
optimal and practical LSH schemes, we prove the first non-asymptotic lower bound on the trade-off 
between the collision probabilities pi and p 2 - So far, the optimal LSH upper bound p = (from 

[1, 2] and cross-polytope from here) attain this bound only in the limit, as pi,p 2 —^ 0. Very small pi 
and p 2 are undesirable since the hash evaluation time is often proportional to l/p 2 - Our lower 
bound proves this is unavoidable: if we require p 2 to be large, p has to be suboptimal. 

This result has two important implications for designing practical hash functions. First, it shows 
that the trade-offs achieved by the cross-polytope LSH and the scheme of [1, 2] are essentially 
optimal. Second, the lower bound guides design of future LSH functions: if one is to significantly 
improve upon the cross-polytope LSH, one has to design a hash function that is computed more 
efficiently than by explicitly enumerating its range (see Section 4 for a more detailed discussion). 

Multiprobe scheme for the cross-polytope LSH. The space complexity of an LSH data 
structure is snh-quadratic^ but even this is often too large (i.e., strongly super-hnear in the number 
of points), and several methods have been proposed to address this issue. Empirically, the most 
efficient scheme is multiprobe LSH [14], which leads to a signihcantly reduced memory footprint 
for the hyperplane LSH. In order to make the cross-polytope LSH competitive in practice with the 
multiprobe hyperplane LSH, we propose a novel multiprobe scheme for the cross-polytope LSH. 

We complement these contributions with an experimental evaluation on both real and synthetic 
data (SIFT vectors, tf-idf data, and a random point set). In order to make the cross-polytope LSH 
practical, we combine it with fast pseudo-random rotations [17] via the Fast Hadamard Transform, 
and feature hashing [18] to exploit sparsity of data. Our results show that for data sets with around 
10^ to 10® points, our multiprobe variant of the cross-polytope LSH is up to 10 x faster than an 
efficient implementation of the hyperplane LSH, and up to 700 x faster than a linear scan. To 
the best of our knowledge, our combination of techniques provides the hrst “exponent-optimal” 
algorithm that empirically improves over the hyperplane LSH in terms of query time for an exact 
nearest neighbor search. 

1.1 Related work 

The cross-polytope LSH functions were originally proposed in [15]. However, the analysis in that 
paper was mostly experimental. Specihcally, the probabilities pi and p 2 of the proposed LSH 
functions were estimated empirically using the Monte Carlo method. Similar hash functions were 
later proposed in [16]. The latter paper also uses DFT to speed-up the random matrix-vector 
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matrix multiplication operation. Both of the aforementioned papers consider only the single-probe 
algorithm. 

There are several works that show lower bounds on the quality of LSH hash functions [19, 10, 20, 
11]. However, those papers provide only a lower bound on the p parameter for asymptotic values of 
Pi and P2-, as opposed to an actual trade-off between these two quantities. In this paper we provide 
such a trade-off, with implications as outlined in the introduction. 


2 Preliminaries 


We use ||.|| to denote the Euclidean (a.k.a. I 2 ) norm on We also use to denote the unit 
sphere in centered in the origin. The Gaussian distribution with mean zero and variance of 
one is denoted by A^(0,1). Let ^ be a normalized Haar measure on S'^~^ (that is, = 1). 

Note that p it corresponds to the uniform distribution over We also let u ~ be a point 

sampled from uniformly at random. For r/ G M we denote 




Pr 

Xr^N(0,l) 


[X>V] 




e 


*^2 dt. 


We will be interested in the Near Neighbor Search on the sphere with respect to the 

Euclidean distance. Note that the angular distance can be expressed via the Euclidean distance 
between normalized vectors, so our results apply to the angular distance as well. 

Definition 1. Given an n-point dataset P C on the sphere, the goal of the (c, r)-Approximate 
Near Neighbor problem (ANN) is to build a data structure that, given a query q G S'^~^ with the 
promise that there exists a datapoint p G P with jjp — ^jj < r, reports a datapoint p' £ P within 
distanee cr from q. 

Definition 2. We say that a hash family % on the sphere 5"^“^ is (n, r2,pi,p2)-sensitive, if for 

every p,q £ one has Pr [h{x) = h{y)\ > pi if \\x — y\\ < ri, and Pr [h{x) = h{y)] < p 2 if 

hr^'H h^'H 

\\x - y\\ > r 2 . 

It is known [6] that an efficient (r, cr,pi,p 2 )-sensitive hash family implies a data structure for 
(c, r)-ANN with space 0(n^+^/pi -|- dn) and query time 0{d ■ n^fpi), where p = • 


3 Cross-polytope LSH 

In this section, we describe the cross-polytope LSH, analyze it, and show how to make it practical. 
First, we recall the definition of the cross-polytope LSH [15]: Consider the following hash family 
% for points on a unit sphere S'^~^ C Let A £ be a random matrix with i.i.d. Gaussian 
entries (“a random rotation”). To hash a point x £ , we compute y = Ax/jjAxjj G and 

then find the point closest to y from {±ej}i<i<(i, where Cj is the f-th standard basis vector of 
We use the closest neighbor as a hash of x. 

The following theorem bounds the collision probability for two points under the above family %. 
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Theorem 1. Suppose that p,q ^ ^ are sueh that ||p — ?|| = r, where 0 < r < 2. Then, 


In 


1 

Ti[h{p) = h{q)\ 
h^n 


- -Tn d + OrflnIn d) . 

4 — 


Before we show how to prove this theorem, we briefly describe its implications. Theorem 1 shows 
that the cross-polytope LSH achieves essentially the same bounds on the collision probabilities as 
the (theoretically) optimal LSH for the sphere from [2] (see Section “Spherical LSH” there). In 
particular, substituting the bounds from Theorem 1 for the cross-polytope LSH into the standard 
reduction from Near Neighbor Search to LSH [6], we obtain the following data structure with 
sub-quadratic space and sublinear query time for Near Neighbor Search on a sphere. 

Corollary 1. The {c,r)-ANN on a unit sphere can he solved in space 0{n^~^P + dn) and query 
time 0{d ■ n^), where p = ^ ■ + o(l) . 

We now outline the proof of Theorem 1. For the full proof, see Appendix B. 

Due to the spherical symmetry of Gausslans, we can assume that p = ei and q = aei + (3e 2 , 
where a, (3 are such that + (3'^ = 1 and (a — 1)^ -|- = r^. Then, we expand the collision 

probability: 


Pr [/i(p) = h{q)] = 2d ■ PT [h{p) = h{q) = ei] 

h'^ri h'^ri 


= 2d ■ Pr [Vi \ui\ < ui and \aui -|- /3vi\ < aui -|- j3vi] 


= 2d- E 


Pr 

X2,Y2L 


IA 2 I < Ai and laX 2 + ml < «^i + PYi 


-| d — 1 


( 1 ) 


where Xi, Yi,X 2 , Y 2 ~ A"(0,1). Indeed, the first step is due to the spherical symmetry of the hash 
family, the second step follows from the above discussion about replacing a random orthogonal 
matrix with a Gaussian one and that one can assume w.l.o.g. that p = e\ and q = ae\ T /3e2', the 
last step is due to the independence of the entries of u and v. 

Thus, proving Theorem 1 reduces to estimating the right-hand side of (1). Note that the 
probability Pr[| A 2 I < Xi and |aA 2 -|- f3Y2\ < otXi -j- fdYi] is equal to the Gaussian area of the planar 
set 5 'xi,Yi shown In Figure la. The latter Is heuristically equal to 1 — , where A Is the 

distance from the origin to the complement of <S'xi,Yi, which is easy to compute (see Appendix A for 
the precise statement of this argument). Using this estimate, we compute (1) by taking the outer 
expectation. 


3.1 Making the cross-polytope LSH practical 

As described above, the cross-polytope LSH is not quite practical. The main bottleneck is sampling, 
storing, and applying a random rotation. In particular, to multiply a random Gaussian matrix with 
a vector, we need time proportional to d?, which is infeasible for large d. 
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Figure 1 



(a) The set appearing in the analysis of the cross¬ 
polytope LSH: 5x1 Yi = {1^1 — + 

/3y| < aXi + /3Yi}. 



(b) Trade-off between p and the number of parts 
for distances •\/2/2 and (approximation c = 
2); both bounds tend to 1/7 (see discussion in 
Section 4). 


Pseudo-random rotations. To rectify this issue, we instead use pseudo-random rotations. In¬ 
stead of multiplying an input vector a: by a random Gaussian matrix, we apply the following linear 
transformation: x i—>■ HD^HD 2 HDix, where H is the Hadamard transform, and Di for i G {1,2, 3} 
is a random diagonal ± 1-matrix. Clearly, this is an orthogonal transformation, which one can 
store in space 0{d) and evaluate in time 0{d\ogd) using the Fast Hadamard Transform. This is 
similar to pseudo-random rotations used in the context of LSH [21], dimensionality reduction [17], 
or compressed sensing [22]. While we are currently not aware how to prove rigorously that such 
pseudo-random rotations perform as well as the fully random ones, empirical evaluations show that 
three applications of HDi are exactly equivalent to applying a true random rotation (when d tends 
to infinity). We note that only two applications of HDi are not sufficient. 

Feature hashing. While we can apply a pseudo-random rotation in time 0{dlogd), even this 
can be too slow. E.g., consider an input vector x that is sparse: the number of non-zero entries 
of X is s much smaller than d. In this case, we can evaluate the hyperplane LSH from [3] in time 
0{s), while computing the cross-polytope LSH (even with pseudo-random rotations) still takes time 
0{dlogd). To speed-up the cross-polytope LSH for sparse vectors, we apply feature hashing [18]: 
before performing a pseudo-random rotation, we reduce the dimension from d to d' <C d by applying 
a linear map x i— )■ Sx, where 5 is a random sparse d' x d matrix, whose columns have one non-zero 
±1 entry sampled uniformly. This way, the evaluation time becomes 0{s + d'logd'). ^ 

“Partial” cross-polytope LSH. In the above discussion, we defined the cross-polytope LSH as a 
hash family that returns the closest neighbor among {±ej}i<i<rf as a hash (after a (pseudo-)random 

^Note that one can apply Lemma 2 from the arXiv version of [18] to claim that—after such a dimension reduction— 
the distance between any two points remains sufficiently concentrated for the bounds from Theorem 1 to still hold 
(with d replaced by d'). 
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rotation). In principle, we do not have to consider all d basis vectors when computing the closest 
neighbor. By restricting the hash to d' < d basis vectors instead, Theorem 1 still holds for the 
new hash family (with d replaced by d') since the analysis is essentially dimension-free. This slight 
generalization of the cross-polytope LSH turns out to be useful for experiments (see Section 6). 
Note that the case d' = 1 corresponds to the hyperplane LSH. 

4 Lower bound 

Let "H be a hash family on 5"^“^. For 0 < ri < r 2 < 2 we would like to understand the trade-off 
between pi and p2, where pi is the smallest probability of collision under for points at distance 
at most ri and p2 is the largest probability of collision for points at distance at least r2- We focus 
on the case r 2 ~ \/2 because setting r 2 to \/2 — o(l) (as d tends to infinity) allows us to replace p 2 
with the following quantity that is somewhat easier to handle: 

P* 2 = Pr [h{u)=h{v)]. 

hr-^TL 

This quantity is at most p 2 -f o(l), since the distance between two random points on a unit sphere 
5'^“^ is tightly concentrated around \/2- So for a hash family "H on a unit sphere we would 

like to understand the upper bound on pi in terms of P 2 and 0 < ri < \/2- 
For 0 < T < V2 and r/ G M, we define 


/ Pr [X>ri] . 

/ X~7V(0,1) 

We are now ready to formulate the main result of this section. 

Theorem 2. Let % he a hash family on 5"^“^ such that every function in % partitions the sphere 
into at most T parts of measure at most 1/2. Then we have pi < A(ri, ry) -|- o(l), where ry G M is 
such that ^c(?y) = P 2 o(l) is a quantity that depends on T and ri and tends to 0 as d tends to 
infinity. 

The idea of the proof is hrst to reason about one part of the partition using the isoperimetric 
inequality from [23], and then to apply a certain averaging argument by proving concavity of a 
function related to A using a delicate analytic argument. For the full proof, see Appendix C. 

We note that the above requirement of all parts induced by TL having measure at most 1/2 is 
only a technicality. We conjecture that Theorem 2 holds without this restriction. In any case, as we 
will see below, in the interesting range of parameters this restriction is essentially irrelevant. 

One can observe that if every hash function in TL partitions the sphere into at most T parts, 
then P2 ^ T (iiideed, P 2 is precisely the average sum of squares of measures of the parts). This 
observation, combined with Theorem 2, leads to the following interesting consequence. Specifically, 
we can numerically estimate A in order to give a lower bound on p = for any hash family TL 


A(t, ry) = Pr 

x,y~7V(o,i) 
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in which every function induces at most T parts of measure at most 1/2. See Figure lb, where we plot 
this lower bound for ri = ■v/2/2,^ together with an upper bound that is given by the cross-polytope 
LSH^ (for which we use numerical estimates for (1)). We can make several conclusions from this 
plot. First, the cross-polytope LSH gives an almost optimal trade-off between p and T. Given that 
the evaluation time for the cross-polytope LSH is 0(TlogT) (if one uses pseudo-random rotations), 
we conclude that in order to improve upon the cross-polytope LSH substantially in practice, one 
should design an LSH family with p being close to optimal and evaluation time that is sublinear in 
T. We note that none of the known LSH families for a sphere has been shown to have this property. 
This direction looks especially interesting since the convergence of p to the optimal value (as T 
tends to infinity) is extremely slow (for instance, according to Figure lb, for ri = and r 2 ~ \/2 

we need more than 10^ parts to achieve p < 0.2, whereas the optimal p is 1/7 ~ 0.143). 

5 Multiprobe LSH for the cross-polytope LSH 

We now describe our multiprobe scheme for the cross-polytope LSH, which is a method for reducing 
the number of independent hash tables in an LSH data structure. Given a query point g, a “standard” 
LSH data structure considers only a single cell in each of the L hash tables (the cell is given by the 
hash value hi{q) for i G [L]). In multiprobe LSH, we consider candidates from multiple cells in each 
table [14]. The rationale is the following: points p that are close to q but fail to collide with q under 
hash function hi are still likely to hash to a value that is close to hi{q). By probing multiple hash 
locations close to hi{q) in the same table, multiprobe LSH achieves a given probability of success 
with a smaller number of hash tables than “standard” LSH. Multiprobe LSH has been shown to 
perform well in practice [14, 24]. 

The main ingredient in multiprobe LSH is a probing scheme for generating and ranking possible 
modifications of the hash value hi{q). The probing scheme should be computationally efficient and 
ensure that more likely hash locations are probed hrst. For a single cross-polytope hash, the order 
of alternative hash values is straightforward: let x be the (pseudo-)randomly rotated version of 
query point q. Recall that the “main” hash value is hi{q) = argmax^gj^] \xj\.^ Then it is easy to see 
that the second highest probability of collision is achieved for the hash value corresponding to the 
coordinate with the second largest absolute value, etc. Therefore, we consider the indices i G [d] 
sorted by their absolute value as our probing sequence or “ranking” for a single cross-polytope. 

The remaining question is how to combine multiple cross-polytope rankings when we have more 
than one hash function. As in the analysis of the cross-polytope LSH (see Section 3, we consider 
two points q = ei and p = aei + (3e2 at distance R. Let be the i.i.d. Gaussian matrix of hash 
function hi, and let = A^'^lei be the randomly rotated version of point q. Given we are 
interested in the probability of p hashing to a certain combination of the individual cross-polytope 

"^The situation is qualitatively similar for other values of ri. 

®More specifically, for the “partial” version from Section 3.1, since T should be constant, while d grows 

®In order to simplify notation, we consider a slightly modified version of the cross-polytope LSH that maps both 
the standard basis vector +ej and its opposite —ej to the same hash value. It is easy to extend the multiprobe scheme 
defined here to the “full” cross-polytope LSH from Section 3. 



rankings. More formally, let be the index of the Wj-th largest element of where v G [d\^ 
specifies the alternative probing location. Then we would like to compute 

Pr \hi(p) = for all i G [k] \ 

a(i),...,aW * L J I j 

k 

= TT Pr argmax |(a • _l_ ^ . yiWg2) J = = x^*^ . 

jm * ^ 

If we knew this probability for all v G [d]^, we could sort the probing locations by their probability. 
We now show how to approximate this probability efficiently for a single value of i (and hence drop 
the superscripts to simplify notation). WLOG, we permute the rows of A so that = v and get 

Pr arg max I (ax + d • 7162)71 = u tIci = x = Pr argmax |(x + — • u),| = u . 

^ J y^Nio,l^)l " a J 

The RHS is the Gaussian measure of the set S' = {y G | argmax^gj^j |(x + ^y)j \ = u}. Similar to 
the analysis of the cross-polytope LSH, we approximate the measure of S by its distance to the 
origin. Then the probability of probing location v is proportional to exp(— ||ya;,i;|P), where yx,v is the 
shortest vector y such that argmax^ \x + y\j = v. Note that the factor /3/a becomes a proportionality 
constant, and hence the probing scheme does not require to know the distance R. For computational 
performance and simplicity, we make a further approximation and use yx,v = (maxj \xi\ — jx^l) • e„, 
i.e., we only consider modifying a single coordinate to reach the set S. 

Once we have estimated the probabilities for each Vi G [d], we incrementally construct the 
probing sequence using a binary heap, similar to the approach in [14]. For a probing sequence of 
length m, the resulting algorithm has running time 0{L ■ dlog d + mlog m). In our experiments, we 
found that the 0{L ■ dlogd) time taken to sort the probing candidates Vi dominated the running 
time of the hash function evaluation. In order to circumvent this issue, we use an incremental 
sorting approach that only sorts the relevant parts of each cross-polytope and gives a running time 
of 0(L • d -f m log m ). 

6 Experiments 

We now show that the cross-polytope LSH, combined with our multiprobe extension, leads to an 
algorithm that is also efficient in practice and improves over the hyperplane LSH on several data 
sets. The focus of our experiments is the query time for an exact nearest neighbor search. Since 
hyperplane LSH has been compared to other nearest-neighbor algorithms before [8], we limit our 
attention to the relative speed-up compared with hyperplane hashing. 

We evaluate the two hashing schemes on three types of data sets. We use a synthetic data set of 
randomly generated points because this allows us to vary a single problem parameter while keeping 
the remaining parameters constant. We also investigate the performance of our algorithm on real 
data: two tf-idf data sets [25] and a set of SIFT feature vectors [7]. We have chosen these data sets 
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in order to illustrate when the cross-polytope LSH gives large improvements over the hyperplane 
LSH, and when the improvements are more modest. See Appendix D for a more detailed description 
of the data sets and our experimental setup (implementation details, CPU, etc.). 

In all experiments, we set the algorithm parameters so that the empirical probability of success¬ 
fully finding the exact nearest neighbor is at least 0.9. Moreover, we set the number of LSH tables 
L so that the amount of additional memory occupied by the LSH data structure is comparable 
to the amount of memory necessary for storing the data set. We believe that this is the most 
interesting regime because signihcant memory overheads are often impossible for large data sets. 
In order to determine the parameters that are not fixed by the above constraints, we perform a 
grid search over the remaining parameter space and report the best combination of parameters. For 
the cross-polytope hash, we consider “partial” cross-polytopes in the last of the k hash functions in 
order to get a smooth trade-off between the various parameters (see Section 3.1). 

Multiprobe experiments. In order to demonstrate that the multiprobe scheme is critical for 
making the cross-polytope LSH competitive with hyperplane hashing, we compare the performance 
of a “standard” cross-polytope LSH data structure with our multiprobe variant on an instance of the 
random data set (n = 2^^, d = 128). As can be seen in Table 2 (Appendix D), the multiprobe variant 
is about 13 X faster in our memory-constrained setting (L = 10). Note that in all of the following 
experiments, the speed-up of the multiprobe cross-polytope LSH compared to the multiprobe 
hyperplane LSH is less than 11 x. Hence without our multiprobe addition, the cross-polytope LSH 
would be slower than the hyperplane LSH, for which a multiprobe scheme is already known [14]. 

Experiments on random data. Next, we show that the better time complexity of the cross¬ 
polytope LSH already applies for moderate values of n. In particular, we compare the cross-polytope 
LSH, combined with fast rotations (Section 3.1) and our multiprobe scheme, to a multi-probe 
hyperplane LSH on random data. We keep the dimension d = 128 and the distance to the nearest 
neighbor R = hxed, and vary the size of the data set from 2^® to 2^®. The number of 

hash tables L is set to 10. For 2^^ points, the cross-polytope LSH is already 3.5 x faster than the 
hyperplane LSH, and for n = 2^® the speedup is 10.3 x (see Table 3 in Appendix D). Compared to a 
linear scan, the speed-up achieved by the cross-polytope LSH ranges from 76 x for n = 2^^ to about 
700 X for n = 2^®. 

Experiments on real data. On the SIFT data set (n = 10® and d = 128), the cross-polytope 
LSH achieves a modest speed-up of 1.2 x compared to the hyperplane LSH (see Table 1). On the 
other hand, the speed-up is is 3 — 4x on the two tf-idf data sets, which is a significant improvement 
considering the relatively small size of the NYT data set (n ~ 300,000). One important difference 
between the data sets is that the typical distance to the nearest neighbor is smaller in the SIFT 
data set, which can make the nearest neighbor problem easier (see Appendix D). Since the tf-idf 
data sets are very high-dimensional but sparse {d ^ 100, 000), we use the feature hashing approach 
described in Section 3.1 in order to reduce the hashing time of the cross-polytope LSH (the standard 
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Data set 

Method 

Query 
time (ms) 

Speed-up 
vs HP 

Best k 

Number of 
candidates 

Hashing 
time (ms) 

Distances 
time (ms) 

NYT 

HP 

120 ms 


19 

57,200 

16 

96 

NYT 

CP 

35 ms 

3.4x 

2 (64) 

17,900 

3.0 

30 

pubmed 

HP 

857 ms 


20 

1,480,000 

36 

762 

pubmed 

CP 

213 ms 

4.0 X 

2 (512) 

304,000 

18 

168 

SIFT 

HP 

3.7 ms 


30 

18,628 

0.2 

3.0 

SIFT 

CP 

3.1 ms 

1.2x 

6(1) 

13,000 

0.6 

2.2 


Table 1: Average running times for a single nearest neighbor query with the hyperplane (HP) and 
cross-polytope (CP) algorithms on three real data sets. The cross-polytope LSH is faster than the 
hyperplane LSH on all data sets, with significant speed-ups for the two tf-idf data sets NYT and 
pubmed. For the cross-polytope LSH, the entries for k include both the number of individual hash 
functions per table and (in parenthesis) the dimension of the last of the k cross-poly topes. 


hyperplane LSH already runs in time proportional to the sparsity of a vector). We use 512 and 
2048 as feature hashing dimensions for NYT and pubmed, respectively. 
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A Gaussian measure of a planar set 

In this Section we formalize the intuition that the standard Gaussian measure of a closed subset 

A C behaves like where is the distance from the origin to A, unless A is quite special. 

For a closed subset A C and r > 0 denote 0 < PAi'f') < 1 the normalized measure of the 

intersection A n rS^ (A with the circle centered in the origin and of radius r): 

n{AnrS^) _ 

27rr ’ 
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here fx is the standard one-dimensional Lebesgue measure (see Figure 2a). Denote := inf{r > 
0 : iXAif) > 0} the (essential) distance from the origin to A. Let Q(A) be the standard Gaussian 
measure of A. 

Lemma 1. Suppose that ^ C is a closed set such that lXA{r) is non-decreasing. Then, 

sup(/Ua(?') • < Q{A) < 

Proof. For the upper bound, we note that 

Q{A) = / pLA{r) ■ re~'^^dr < / dr = 

do JAa 

For the lower bound, we similarly have, for every r* > 0, 



where we use that /U^(r*) is non-decreasing. □ 

Now we derive two corollaries of Lemma 1. 

Lemma 2. Let K (Z'S? he the complement of an open convex subset of the plane that is symmetric 
around the origin. Then, for every 0 < e < 1/3, 

0(51/2 . g-(l+e).A|,/2^ < 

Proof. This follows from Lemma 1: indeed, due to the convexity of the complement of K, p.K{f) is 
non-decreasing. It is easy to check that 

//k(( 1-f e)Ax) =D(ei/2), 

again, due to the convexity (see Figure 2b). Thus, the required bounds follow. □ 

Lemma 3. Let K he an intersection of two closed half-planes such that: 

• K does not contain a line; 

• the “corner” of K is the closest point of K to the origin; 

• the angle between half-planes equals to 0 < a < tt. 

Then, for every 0 < s < 1/2, 

D„(e • < g(K) < e-^n/^. 
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B Proof of Theorem 1 


In this section we complete the proof of Theorem 1, following the outline from Section 3. Our 
starting point is the collision probability bound from Eqn. (1). 

For u,v gM. with u >0 and au + l3v >0 define, 

a(u,v) = Pr \\X 2 \ < u and \aX 2 + 0Y2\ < au-\-dv\. 

X2,F2~Af(0,l) 

Then, the right-hand side of (1) is equal to 


2d- E [a{Xi,YiY-\ 
Xi,Yi~iV(0,l) 


Let us define 


A(rt, v) = min{rt, au + j3v}. 

Lemma 4. For every 0 < e < 1/3, 

1 - < a{u,v) <1-54 . g-(i+£)A(«,^)V2^ _ 

Proof. This is a combination of Lemma 2 together with the following obvious observation: the 
distance from the origin to the set {(x, y) : |x| > tt or \ax + /3?/| > au + fiv} is equal to A(u, v) (see 
Figure la). □ 


Lemma 5. For every t >0 and 0 < e < 1/3, 

0,r (£ ■ e 


-(!+£)■ 4 


^ ' < Pr [A(Ai,Yi) > tl < e 2 

Xi,yi~Ar(0,l) 


Proof. Similar to the previous lemma, this is a consequence of Lemma 3 together with the fact 
that the squared distance from the origin to the set {{x,y): x > t and ax + Py > t} is equal to 

□ 


4 —T 


^ ' I . 


B.l Idealized proof 

Let us expand Eqn. (1) further, assuming that the “idealized” versions of Lemma 4 and Lemma 5 
hold. Namely, we assume that 

a(u,u) = 1 (2) 


and 


Pr [A(Ai,yi) > t] = e 

Xi,yi~Af(o,i) 


(3) 


In the next section we redo the computations using the precise bounds for a{u, v) and Pr[A(Ai, Yi) > 
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Expanding Eqn. (1), we have 


E [ct{XuYi) 


d—1] 


= Pr [a(Xi,Yi)>t^]dt 

Jo Xi,yi~Af(o,i) 

Jo Xi,yi~Af(o,i) 

/•l 1 4 

= / (1 — ) 4-^2 dt 

Jo 

= {d — 1) ■ [ {1 — u)Y^dt 
Jo 

.2 \ 


= {d-l)-B 


ii- T“ 

4 — 7-2 


:d-l 


= er{l)-d 


where: 

• the first step is a standard expansion of an expectation; 

• the second step is due to (2); 

• the third step is due to (3); 

• the fourth step is a change of variables; 

• the fifth step is a definition of the Beta function; 

• the sixth step is due to the Stirling approximation. 
Overall, substituting (4) into (1), we get: 


In 


1 

Pr [/i(p) = h{q)] 

h^H 


4 — 7-2 


• In d ± Or(l)- 


(4) 


B.2 The real proof 

We now perform the exact calculations, using the bounds (involving e) from Lemma 4 and Lemma 5. 
We set e = 1/d and obtain the following asymptotic statements: 

a{u,v) = 1 - d±^(i) • 


and 

Pr [A{X, Y)>t]= . 

x,y~Af(o,i) 
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Then, we can repeat the “idealized” proof (see Eqn. (4)) verbatim with the new estimates and 
obtain the final form of Theorem 1: 


in 


1 

Pijhip) = h{q)] 


- - t; • in d ± Ot- (in in d). 

4 — 


Note the difference in the low order term between idealized and the real version. As we argue in 
Section 4, the latter OT-(lnlnd) is, in fact, tight. 


C Proof of Theorem 2 


Lemma 6. Let A C S'^ ^ be a measurable subset of a sphere with p{A) = po < 1/2. Then, for 
0 < T < ^J2, one has 


Pr \v ^ A \ u ^ A, \\u — nil < rl 


Pr [X > n and aX + /3Y > n] + o(l) 
x,y~Af(o,i) 


Pr 

X~7V(0,1) 


[X>7j]+0{1) 


(5) 


where: 


a = 1 - 



2ll • 

4 ^ 


• r? G M such that Pr [A > 77 ] = un. 
X~Af(0,l) 

In particular, if po = 11(1), then 


Pr [v £ A \ u £ A, \\u — n|| < r] = A(r, ^(ho)) + o(l)- 

Proof. First, the left-hand side of (5) is maximized by a spherical cap of measure pQ. This follows 
from Theorem 5 of [23]. So, from now on we assume that A is a spherical cap. 

Second, one has 


Pr [n G A I rt G A, jju — njj < r] 

= Pr [n G A I n G A, jjrt — njj = r ± o(l)] -f o(l) 

Pr \ui > rj and (a ± o(l))ui -f (/3 ± o(l))tt 2 > rj] 
_ / 

Pr [iii > rj] 

Pr \X > 7] and aX -|- BY > n] + o(l) 

_ X,Y^N{0,1) 

~ Pr [A > 77] + 0(1) ’ 

X~Ar(0,l) 


+ 0 ( 1 ) 


where 77 is such that Pr [ui >rj\ = po and: 
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• the first step is due to the concentration of measure on the sphere; 

• the second step is expansion of the conditional probability; 

• the third step is due to the fact that a 0 (l)-dimensional projection of the uniform measure on 
a sphere of radius '/d in converges in total variation to a standard Gaussian measure [26]. 

□ 

Lemma 7. For every 0 < r < \/2, the funetion fi i-)- A(t, is eoneave for 0 < < 1/2. 

Proof. Abusing notation, for this proof we denote A(r/) = A(r, rj) and 


/(u) = Pr \X > T] and aX + BY > n] 
X,y~Ar(0,l) 


_ j‘2 

(that is, k{r\) = I{r])/^c{'n))- One has d>/(r/) = —and 

Combining, we get 


/3 


A'(7?) = 


V/2 /(r?)-2cl>,(r/)cl>,((i^) 




^dr])" 


and 




=; n{,' 


d>c(ry*)2 

where r]* = It is sufficient to show that 11 ( 77 *) is non-decreasing in 77 * for 77 * > 0. 

We have 


n'(r7) = 




1 - a 


/3 


a(l-a) 2 \ 

e '3" ^ ^c(?/)^j 

2 


V TT ‘hc(r7)^ 




We need to show that 11 ( 77 ) — 0 d — 0- We will do this by showing that fif'd) — 0 for 77 > 0 
and that lim^^oo fi(d) — 0 - The latter is obvious, so let us show the former. 


nfrf) = - 


2a(l-a)2 ^,^,2 


/I3 


e~^ • ^c(d) • ^ < 0 


for 77 > 0 . 


□ 


Now we are ready to prove Theorem 2. Let us first assume that all the parts have measure 12(1). 
Later we will show that this assumption can be removed. W.l.o.g. we assume that functions from 
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the family have subsets integers as a range. We have, 


Pi < Pr [h{u) = h{v) ||u — ull < rl 
h^H 


= E 

hr^H 


< E 

hr^H 


^(i))Pr[u G/i ^{i)\uGh ^(i), ||u — u|| < r] 




+ 0 ( 1 ) 


< A T,^ 


-1 


E 




,hr^n 
\ \ 

<A{T,^-\p*{n))) + o{i), 


+ 0 ( 1 ) 


where: 

• the first step is by the definition of pi ; 

• the third step is due to the condition p{h~^{i)) = 11(1) and Lemma 6; 

• the fourth step is due to Lemma 7 and the assumption p{h~^{i)) < 1/2; 

• the final step is due to the definition of p*{Ti). 

To get rid of the assumption that a measure of every part is H(l) observe that all parts with 
measure at most e contribute to the expectation at most e • T, since there are at most T pieces in 
total. Note that if e = o(l), then s - T = o(l), since we assume T being fixed. 

D Further description of experiments 

In order to compare meaningful running time numbers, we have written fast C++ implementations 
of both the cross-polytope LSH and the hyperplane LSH. This enables a fair comparison since 
both implementations have been optimized by us to the same degree. In particular, hyperplane 
hashing can be implemented efficiently using a matrix-vector multiplication sub-routine for which 
we use the eigen library (eigen is also used for all other linear algebra operations). For the fast 
pseudo-random rotation in the cross-polytope LSH, we have written a SIMD-optimized version 
of the Fast Hadamard Transform (FHT). We compiled our code with g++ 4.9 and the -03 flag. 
All experiments except those in Table 3 ran on an Intel Core 15-2500 CPU (3.3 - 3.7 GHz, 6 MB 
cache) with 8 GB of RAM. Since 8 GB of RAM was too small for the larger values of n, we ran the 
experiments in Table 3 on a machine with an Intel Xeon E5-2690 v2 CPU (3.0 GHz, 25 MB cache) 
and 512 GB of RAM. 

In our experiments, we evaluate the performance of the cross-polytope LSH on the following 
data sets. Figure 3 shows the distribution of distances to the nearest neighbor for the four data sets. 
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Random data set 


NYT data set 


a 

"3 

a 

«-!-( 

° 0.5 - 

.2 

03 

^ 0 -^— M —^- 

0 V2 V2 3V2 
Distancd to n&rest iieighbor 


SIFT data set 




pubmed data set 



Figure 3: Distance to the nearest neighbor for the four data sets used in our experiments. The SIFT 
data set has the closest nearest neighbors. 


random For the random data sets, we generate a set of n points uniformly at random on the unit 
sphere. In order to generate a query, we pick a random point q' from the data set and generate a 
point at distance R from q' on the unit sphere. In our experiments, we vary the dimension of the 
point set between 128 and 1,024. Experiments with the random data set are useful because we 
can study the impact of various parameters (e.g., the dimension d or the number of points n) 
while keeping the remaining parameters constant. 

pubmed / NYT The pubmed and NYT data sets contain bag-of-words representations of medical 
paper abstracts and newspaper articles, respectively [25]. We convert this representation into 
standard tf-idf feature vectors with dimensionality about 100,000. The number of points in the 
pubmed data set is about 8 million, for NYT it is 300,000. Before setting up the LSH data 
structures, we set 1000 data points aside as query vectors. When selecting query vectors, we limit 
our attention to points for which the inner product with the nearest neighbor is between 0.3 and 
0.8. We believe that this is the most interesting range since near-duplicates (inner product close 
to 1) can be identified more efficiently with other methods, and points without a close nearest 
neighbor (inner product less than 0.3) often do not have a semantically meaningful match. 

SIFT We use the standard data set of one million SIFT feature vectors from [7], which also contains 
a set of 10,000 query vectors. The SIFT feature vectors have dimension 128 and (approximately) 


20 

















live on a sphere. We normalize the feature vectors to unit length but keep the original nearest 
neighbor assignments—this is possible because only a very small fraction of nearest neighbors 
changes through normalization. We include this data set as an example where the speed-up of 
the cross-polytope LSH is more modest. 


Method 

k 

Last CP 
dimension 

Extra 

probes 

Query 
time (ms) 

Number of 
candidates 

CP hashing 
time (ms) 

Distances 
time (ms) 

Single-probe 

1 

128 

0 

6.7 

39800 

0.01 

6.3 

Multiprobe 

3 

16 

896 

0.51 

867 

0.22 

0.16 


Table 2: Comparison of “standard” LSH using the cross-polytope (CP) hash vs. our multiprobe 
variant (L = 10 in both cases). On a random data set with n = 2^*^, d = 128, and R = \/2/2, the 
single-probe scheme requires 13 x more time per query. Due to the larger value of k, the multiprobe 
variant performs fewer distance computations, which leads to a better trade-off between the hash 
computation time and the time spent on computing distances to candidates from the hash tables. 


Data set size n 

O 

iM 

CM 

222 

224 

226 

228 

HP query time (ms) 

2.6 

7.4 

25 

63 

185 

CP query time (ms) 

0.75 

1.4 

3.1 

8.8 

18 

Speed-up 

3.5X 

5.3X 

8.lx 

7.2 X 

10.3X 

k for CP 

3(16) 

3 (64) 

3 (128) 

4(2) 

4 (64) 


Table 3: Average running times for a single nearest neighbor query with the hyperplane (HP) 
and cross-polytope (CP) algorithms on a random data set with d = 128 and R = The 

cross-polytope LSH is up to 10 x faster than the hyperplane LSH. The last row of the table indicates 
the optimal choice of k for the cross-polytope LSH and (in parenthesis) the dimension of the last of 
the k cross-polytopes; all other cross-polytopes have full dimension 128. Note that the speed-up 
ratio is not monotonically increasing because the cross-polytope LSH performs better for values of 
n where the optimal setting of k uses a last cross-polytope with high dimension. 
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